DOCOHEWT HESOHE 



t 



BD 206 6110 TH 810 528 

AOTROS Hagy, Philip: Jarchow, Elaine HcN&lly 

TITLE Estiiatiag ?ariance Cosponents of Essay Batings in a 

Coiplex Design* 
POB DATE Apr 81 

VOTE 58p«: Paper presented at the Annual Meeting of the 

Aierican Educational Research Association (6 5th, Los 
Angeles, CA, April 13-17, 1981J . 

EDSS PRICE HP01/PC03 Plas Postage* 

DSSCBIPTOBS ^Analysis of Variance; Essay Tests; Foreign 

Countries: Grade 10: High Schools; dodels; ^Scoring; 
♦Test Beliability: ♦writing CCoiposition) ; *Mriting 
Evaluation: Vriting Skills 

IDERTIPIEBS Generalizability Theory: lova; Hewf oundland ^ 

.ABSTRACT 

This paper discusses variables influencing written 
coiposition quality and hov they can best be controlled to isprove 
the reliability of assesssent of writing ability* The study aised to 
concentrate on the effects of different iiposed tiie lisits on 
student performance in written coiposition at grade 10 level using 
■ethods of generalizability theory; however, as the study progresses 
the saior focus becomes sethodological issues. Research stifdies and 
literature on this subject are reviewed before the design of the 
study is described* The sasple of 320 students fros Iowa and 
Newfoundland schoals who were divided into eight groups (each group 
being assigned eight judges, who would assess the students* 
performance on two written topics under different time constraints) 
is presented* Scoring procedures are outlined* Discussed are the 
analytic methods of both Hodel 1 (a seven way analysis of variance 
model) and its limitations, and Hodel 2, which was adopted tx> analyze 
the variance for a confounded model incorporating elements of topic, 
time limit, and group* The results and limitations of the study are 
documented in the concluding section* Data analysis statistics are 
appended* (AEP) 



«««««««««««««««««««« «««««««««ic-« 

^ fieprod notions supplied by EORS are the best that can oe made * 
* from the original document* * 

ERIC 



o 

sO 

o 



U t OEPAMTMCNT OF EDUCATION 

NATIONAL INSTITUTE OF EDUCATION 

EDUCATIONAL RESOURCES INFORMATION 

CENTER (ERIO 
^Thu document hat b««n reproduced as 
recefved from rhe person or organz8r<on 
onginaiing it 

Minor changes have been made to improve 
reproduction quality 

• Points of view or opinions stated m this dot J 
ment do not necessarily represent oftic»af NIE 
position or policy 



ESTIMATING VARIANCE COMPONENTS OF ESSAY 
RATINGS IN A COMPLEX DESIGN 



Philip Nagy 
Memorial University of Newfoundland 

and 



Elaine McNally Jarchow 
Iowa State University 



•PERMISSION TO REPRODUCC THIS 
MATERIAL HAS BEEN GRANTED BY 



TO THE EDUCATIONAL RESOURCES 
INFORMATION CENTER (ERIC) " 



A Paper Presented to 
the Annual Meeting, American Educational Research Association 
Los Angeles, California 
1981 



This research was supported by an Iowa State University Research Grant 
and Memorial University's Institute for Educational Research and 
Development. Opinions expressed are those of the authors. 



2 



ERIC 



- 1 - 



INTRODUCTION 

This is a report on a large scale study of factors influencing grades 
assigned to written composition at the grade ten level, using the methods 
of generalizability theory (Cronbach et al, 1972) . As originally conceived, 
the purpose of the study was to investigate the effects of differing 
time limits on student writing peitfoiTnance, along with differences in subjects 
and judges and other related factors. During the evolution of the investiga- 
tion, it became clear that the focus of the report would be on methodological 
. issues of generalizability theory, rather than on substantive issues of 
essay writing and grading. These dual purposes, then, are interwoven 
throughout the report. 

The assessment of writing ability has been a long standing concern of 
teachers and researchers in language education. Coffman, in a review of the 
field, found that "few of the important questions have been answered in any 
final way." (1971, p. 296). Major overlapping themes of research can be 
discerned in the literature. Bereiter and his colleagues and students 
(e.g. J Bereiter £t al , 1979), among others, are concerned with the process 
of writing, with a view to understanding the skills necessary to produce good 
written work, and to iJT5)raving the teaching of written composition in the 
classroom. (Jodshalk et al (1966) focus on the indirect assessment of 
writing ability by devices such as multiple-choice tests of components of 
the writing process. The major substantive concern of this paper is not 
pn the process of writing, but on that of grading of writing products. While 
the direct motivation is the understanding and improvement of the grading 
of composition, progress in clarifying the various influences on grading 

Thanks are due to Tom Maguire, Bill Spain, and others for discussions of 
the analysis problems. Remaining errors are the fault of the first author. 




• 2 - 

should be of interest to those whose main concerns are other aspects of 
written composition. 

One main purpose of this study is to investigate the relationship 
between writing performance under two conditions, **at home*' and **exajn'\ 
The goal of teaching composition is to produce students who are able ^o 
write for personal purposes in later life. Such ''personal purposes", of 
course, will often include later academic pursuits. Classroom assessment 
of student ability is usually done by grading work produced under two sets 
of circumstances: one, over a period of several days under "at home" or 
'^assignment" conditions, and two, within a specific time limit, under "exam" 
conditions. The former set of conditions is more representative of the end 
goal of instruction, and the assun5)tion has traditionally been that writing 
ability under exam conditions is an accurate reflection of the same ability 
under more relaxed conditions. This assumption has not been tested in any 
systematic manner. 

The ipethods of generalizability theory, involving the calculation of 
variance component estimates and the acconpanying generalizability coefficients, 
have been given their primary iinpetus by the work of Cronbach et al (1972) . 
Cardinet et al (1976) and Rentz (1980) have produced considerably simpler 
"rules-of- thumb" views of the methods, making them more accessible to 
researchers. 

The analysis began using a ccnplex, nested design, (Model 1) which 
could be viewed either as inconplece, or as a Latin Square with confounding 
of some of the factors of major interest. Dissatisfaction with the 
confounding problem lead to a reconceptualization of the design in a rnanner 
which effectively removed the confounding. This second design, dubbed Model 
2, allowed estimation of the variance con5)onents, of interest. Both 



ERLC 



- 3 - 



conceptualizations of the design contained several replications of an 
experimental unit. Con^arisons across these replications allov/ed exajfnina- 
tion of the stability of variance con5)onent estimates. Our results support 
the concerns expressed by Smith (1978) . 

The structure of the nesting in the design limits the types of 
calculations of general izability coefficients which can be made. Thus, 
methodologically, the interest of the report is in the two analysis of 
variance models used, the solving of the confounding problem, and in the 
data on variance component estimate stability. 

BACKGROUND 

Among the important questions to which researchers in the assessment 
of w^xting ability have sought answers are: 

1. Can writing ability be assessed through indirect (amenable to 
machine scoring) methods? 

2. Is it more efficient and reliable to score essays analytically 
or holistically? 

3. What are the variables influencing written composition quality? 
How are they best controlled to improve reliability of estimates 
of writing ability? 

This last question is of direct concern to this report, but accurate 

estimation of writing quaJity is a major tool in research on the first two. 

Any discussion of related research cannot entirely separate the three 

questions. 

Godshalk (1966) undertook a major study of the relationship between 
direct and indirect assessment of writing ability at the high school level. 
Although his main concern is not central here, as part of the investigation, 
he sought reliability estimates for scores assigned to his direct measures. 
Without reliable direct estimates of writing ability, his investigation of 



ERIC 



- 5 



• 4 - 



their relationship to indirect methods would be difficult. To combat t!ie 
well known inconsistency of essay grades, Godshalk employed a technique of 
obtaining multiple writiiig ranples from each student, and having these 
graded by several graders. Godshalk comn^^ented, . . . the unreliability 
of essay tests came from two major sources: the differences in quality of 
student writing from one topic to another, and the differences among readers 
in what they consider the characteristics of good writing." 

In his study, Godshalk used five topics per student. The essays 
were timed, some forty minutes and some twenty minutes, and contained both 
descriptive and expository topics. The readers, with no experience in reading 
brief essays produced by high school students, were asked to make holistic 
judgments (3,2 or 1). Each essay was scored by 5 readers. Some twenty-five 
readers rated 646 papers and achieved a reliability estimate of 0.921 for 
scores on the samples and 0.841 for estimates of writing ability. Godshalk's 
calculational methods were based on analysis of variance, (see Ebel, 1951) 
but did not follow the methods of general izability theory directly. Reanalysis 
of the data presented in Table 1 (page 12) of Godshalk produces a general iza- 
bility coefficient of student writing ability, for five topics crossed with 
five readers, both random, of 0.912. Two broad generalizations from this 
study are: *The reliability of essay scores is primarily a function of the 
number of different essays and the number of different readings included/' 
*The most efficient predictor of a reliable direct measure of writing 
ability is one which includes essay questions or interlinear exercises in 
COTibination with objective questions." This second conclusion, while 
central to Godshalk's objective, is incidental here. 



Many studies (e.g., Finlayson, 1951; Vernon and Millican, 1954)yfiave 
considered essay scoring procedures primarily in terms of their relia^)ility. 




ERLC 



6 



In a major review, Coffman (1971) concluded that "a) different raters 
lend to assign different grades to the same paper; b) a single rater tends 
to assign different grades to the same paper on different occasions; and 
c) the differences tend to increase as the essay question permits greater 
freedOTi of response" (p. 277). According to Coffman, raters differ in a 
number of ways: a) severity; b) extent to which they distribute grades 
throughout the score scales; and c) understanding of relative values 
assigned papers. Reliability becomes greater i^en a small group whose 
backgrounds of instruction are homogeneous act as scorers. 

Bond and Tamor (1979) report that while differences in judges have 
been well documented, little is known about the variability of student 
performance^ over different occasions. In the present context, little is 
known about the relationship between performance with and without the 
pressure of time. 

A concern ovpxscoTu^^^ has led researchers to distinguish 

between holistic and analytic scoring. Prior to this distinction, scorers 
were ofte^l provided a list of do's and don'ts. For example, Ahman and Clock 
(1975) suggest "If the spelling, penmanship, grammar, and writing style of 
the responses are to be scored, it should be done indepcjndently. If this is 
not possic^e, score them a second time yourself, doing so without knowledge 
of your first Jfcuiglli^ts." . (pp. 139-141). 

More recently, \though, the holistic scoring method has been accepted 
by both practitioners mtd^rg&e^rchers . The holistic method asks the rater to 
read for a judgment of the whole product and assign a rating on that basis. 
The analytic method might set forth a series of criteria similar to those 
identified by McNally (1977) and suggest weightings. In similar fashion, 
Diederich and Link, (1967) proposed five analytic characteristics: ideas, 



- 6 - 



form, flavor, mechanics, and wording. 

As early as 1950, Coward suggested that it is possible to get two 
global ratings in the time it takes to do one analytic score and the 
reliability is the same. "There is some evidence, for example, that for 
content examinations raters are able to distinguish more than 9 quality 
levels without an appreciable increase in the time required for making 
the ratings.'' (Coffman^ 1971, p. 295). 

Powills, Bowers, and Conlan (1979) describe an Indianapolis Writer/s 
Clinic for teachers of seventh and eighth graders. The teachers, volunteers 
for the clinic, participated in a holi5tic scoring experiment of two 
controlled writing assignments. Although details of their methodology 
can be criticised, they concluded that the holistic method was an efficient 
and reliable method of scoring papers. Reading at the rate of 64 papers per 
hour, readers achieved an average reliability (Cronbach's alpha) on the sum 
of three independent ratings of about 0.80. Conlan (1980) reported scorer 
frustration with the time needed or analytic scoring. 

Although the major sources of variation among scores on written ^ 
composition are inter- and intra-student and judge differences, some 
investigations have demonstrated the effect of other variables on score 
variance. 

In the present study, papers were typed to remove the effects of 
handwriting quality. The variable of handwriting and its effect on raters 
was addressed by Markham (1976) . She asked the question, what is the effect 
of quality handwriting on elementary school teachers eval.iation of 
elementary school children's written work? After analysis of forty-five 
teachers' and thirty-six student teachers' evaluations, she concluded that 
papers with better handwritting consistently received higher scores than 



ERIC 



8 



did those with poor handwriting regardless of quality of content. This 
conclusion was also drawn at the secondary leve] by Chase (1968), 
Briggs (1970J, and Soloff (1973). 

Chase (1979) continued this consideration of the handwriting variable 
by examining the effect of achievement expectancy on scores given essay 
tests which varied widely in quality of handwriting. He concluded that 
handwriting alone was not the most significant variable in scoring. In 
fact, vAien high expectancy was coupled with poor handwriting, expectancy 
appeared to negate the effect of poor handwriting. 

Another important variable in studying writing is that of mode of 
discourse. Cooper (1977) contends that different writing attributes 
characterize varied modes of discourse. For example, developing an 
argument in writing takes half as much time in words per minute as 
reporting a personal experience. 

Crov>rtiurst (1978) concluded that argument outranks both narration and 
description in syntactic complexity for grades six and ten. Rosen (1969) 
found th?t fifteen and sixteen year olds produced longer T-units in 
referential writing than they did in expressive or personal writing. 
Coffman (1971) in discussing the nature of the essay test also considered 
the nature of task complexity when he postulated, "The more con5)licated 
the question, the more time required to compose and record an answer." 
(p. 280). 

Shale (1978), examining different modes of discourse, discusses 
criteria which have been identified for analytic scoring. He identified 
criteria cannon to both exposition and desc, .ption. In nis study 168 



- 8 " 



twelfth grade students wrote on two topics at home (one descriptive 
and one expository). Four well trained markers scored the essays. The 
findings prompted the researcher to conclude that there are criteria 
that characterize one type of writing and not another. Many of the same 
criteria are used to evaluate expository and descriptive writing but they 
are often perceived differently by raters. 

Generalizability theory represents a powerful tool for the investi- 
gation of phenomena which are influenced by several sources of variance. 
Scores on written conposition are an excellent exainple. The essence of 
generalizability theory is two- fold: first, by working through equations 
stating the theoretical expectations for the values of mean squares (EMS) 
in analysis of variance, it is possible to calculate values for each 
variance source in the EMS; second, by assigning variance component 
estimates (VCE's) to either true score or error score as appropriate, 
estimates of the generalizability of measures across various combinations 
of conditions can be made. In Cronbach's (1972) terminology, a G 
(generalizability) study is done to obtain the VCE's. Using these values, 
a D (decision) study can be planned to maximize generalizability across 
the domain of substantive interest. The present study was intended as a 
G study, done more for demonstration of a method than with £ny particular 
D study in mind. 

While the theoretical underpinnings of the ANOVA calculations done 
below are available from many sources (e.g^, Winer, 1971), reliance was 
placed on the Millman and Glass (1967) system and terminology. Rentz 
(198C) has provided guidelines for separation of VCE^s into true and error 
scores for a complex, nested design. Analysis of the confounding problems. 



ERIC 



although ione independently, is similar to that of Linhart (1979). 

There have been a few studies which have applied tlie techniques of 
generali^ability theory to essay grading. Godshalk et al . (1966) found 
reliabilities of ^ummated ratings (five readers) of individual student 
products frail the same student (total of 25 grades), 0.92. He also foiind 
that there was little benefit to be gained by increasing the number of 
readers from four to five. The most inq^ortant result to be drawn from 
this study, i^ich focused on the relationship of indirect to direct 
measures of reading ability, is that considerably more reliability can be 
gained by increasing the iiumber of writing samples drawn from students 
than by increasing the number of readings of each writing sample. 

Steele (1979) reports general^zability coefficients in agreement 
with those of Godshalk., r^§t|rtir)tg with a base reliability of 0.58 for 
two samples read by two scorers, reliability could be increased to 0.65 
by adding a third sample, bur to only 0.62 by adding a third scorer. 
Steele's study used college students, but it is not clear rom the report 
hov' many subjects were the basis for the above reported values. 

Rentz (1980), in a primarily methodological study, reports general! z- 
ability coefficients for three judges scoring two samples per student 
using analytic scoring methods. Given the differences between his study 
and those of Godshalk and Steele, it is difficult to draw direct .canpari- 
sons. However, his reported C|)effj£ients are in the range 0.85 and abpve. 

."lo.llowing the reconmendatiOTS of Coffman (1972), the present sUidy 
uses^holistic ratings on a more detailed rating scale than those reported 
above. It includes estimates of the effects on essay grVdes of topics, 
time limits, the countries of both students and judges, as well as the 



' 11 



- 10 - 



influences of student and judge variations, as previously investigated by 
others. Among the most noteworthy omissions in the list of investigated 
variables is the influence of scoring context, in the sense of whether a 
paper is preceded in the scoring pile by high or low quality products. 
For a discussion of this variable, see. Hughes §1^- (1980). 

^PURPOSE . . . 

This study focuses op the effects imposed time limits on student 
performance in v/ritten conposition. Students wrote on two topics under 
two time limits, but no student wrote any one topic under both time limits. 
If we consider the students who wrote Topic 1 under Time Limit 1 and Topic 
2 under Time Limit 2 as Group 1, and those who vTote Topic 2 under Time 
Limit 1 and Topic 1 under Time Limit 2 as Group 2, the sources of confound- 
ing in the design can be seen. Topic, Time Limit and Group are confounded, 
in a manner similar to the problem encountered by Linhart (1979). 

Keeping in mind the shift from an initial dual interest in both G 
studies and essay grading to more of an emphasis on methodological problems, 
the pur^joses of the investigation, in chronological order rather than order 
of importance are: 

1. To produce additive con^^onents for a complex analysis of 
variance, first for a confounded model (Model 1); 

2. To discuss the assumptions necessary to untangle the confounding, 
and through prelimina.ry data analysis, to show the limitations of 
this approach; 

3. To produce additive components for an unconfounded analysis, 
Model 2; 

4. To calculate variance component estimates under both Models; 



5. To relate the analysis of variance results to issues in essay 
grading; 

6v To investigate the instability of reported variance component 
estimates; 

7. To calculate exairple general izability coefficients. 

The order of sections following in this report requires some explana- 
tion. The COTqplexity of the design necessitates description in order to 
make serise of the sanpling procedure. Then, the analytic methods of Model 
1 must be fully discussed so that their limitations may be seen, and the 
necessity for abandonment of this procedure made clear before Model 2 can 
be described. Results and discussion follow description of Model 2. 

OESIGN 

The design as originally conceived was a balanced eight-way analysis 
of variance, with both nested and c ossed factors, and three of the eight 
factors confounded. Because of the confounding, the design is presented 
below as a seven-way ANOVA (Model 1) , with a discussion of the assumptions 
necessary to obtain estimates of the mean square for the eighth factor 
and its interactions. The lack of empirical support for these assumptions 
is part of the reason for the introduction of Model 2. 

Grade 10 students were asked to wrice two short coimositions on 
different topics. Topic 1 was descriptive: "Describe trying to get to 
sleep with a mosquito buzzing'', and topic 2 was an argument: "Argue for 
or against the statement 'Children over the age of fourteen should have as 
much say as parents in reaching decisions which affect the entire family'". 
The first essay written was allowed a time limit of 30 minutes, while the 



' 13 



- 12 - 



.acond was allowed a class pericxi, plus overnight for revision. Topics 
and Time Limits were crossed^ Students operated under the assumption 
that the grades would ''count", and participating teachers were encouraged 
to use the grades provided by the judges (almost all chose to do so, with 
scmie reserving the right to remark the papers themselves) . The shorter 
time limit will be referred to below as Limit 1 and the overnight time 
limit as Limit 2. Students who wrote Topic 1 under Limit 1 and Topic 2 
under Limit 2 will be called Group 1, vAile those writing Topic 1 under 
Limit 2 and Topic 2 under Limit 1 will be called Group 2. Students, 
topics, time limits and groups are symbolized S, T, L and G respectively. 
Students were balanced by country of origin: Country 1 being Canada (in 
fact, the Province of Newfoundlarxd only) ^nd Country 2 the United States 
(in fact, the State of Iowa only). 

Each written composition was typed "as is", to eliminate problems due 
to handwriting quality, and graded by four judges, (J), two from each 
country. (In further discussior, the country of student will be symbolized 
C, ana that of the judge K). s were nested within topics. 

In order to maintain canplete crossing between judges and students, 
and at the same time require only a reasonable marking load from each 
judge, another factor was introduced, duplications (D) . The more traditional 
terms, 'replications' and 'experiments', are both used in Model 2, with 
different meanings. The use of 'duplications' reduces confusion. (See 
Smith, 1978). Figure 1 displays the design within one diplication. Thus, 
the eight factors are Topics, Limits, Students, Judges, Country of Judge, 
Country of Student^ Groups (which can be thought of as order of writing) 
and Duplications. 

[Insert Figure 1 about hei^e] 



14 



- 13 - 



Ten students are nested within each group and country (C) , and two 
groups within each country. Two judges are nested within topic and 
country (K) . Each row of the design contains the eight scores assigned 
to each student, and each column represents the workload, forty papers, of 
each judge. Thus, there are forty students and eight judges in each 
duplication. In total, there are eight duplications, so that the entire 
sajiple consists of 640 papers written by 320 students, given 2560 grades 
by 64 jrxiges. 

Since no student was asked to write a given topic under both time 
limits, the design could have been conceived as balanced and incomplete. 
Winer (1971, p. 711-713) suggests, however, that it be considered as a 
repeated measures Latin Square, eliminating the incompleteness, but 
not the confounding. This problem will be taken up again after 
discussion of other issues. 

SAMPLE AND DATA COLLECTION 

Selection of students within schools was randan, but selection of 
schools was not. Saii5)le schools were selected in Newfoundland first. As 
the population of the Province is largely rural, an attempt was made to 
solicit cooperation from schools in the larger centres, in order to make 
the sanples from the two locations more con^arable. Using available 
standardized test data from the areas in which schools were located, an 
attenpt was made to locate Iowa schools which were approximately matched 
on test scores, school size, and community size. Thus, the sanples are 
cross-sectional in nature, tending more towards rural than urban, and on 



15 



- 14 - 



balance, slightly below national noms. We make no claim that the samples 
are "matched". 

Schools were assigned, in pairs, one from each location, to Group 1 
or Group 2 treatment. In Figure 1, Groups 3 and 4 received treatment of 
Groups 1 and 2 respectively. The different subscripts are intended to 
emphasize that they are different students. Of the 22 schools used, 
approximately half were large enough to have mere than one grade ten class 
In these cases, a middle ability acaderic class was chosen. In four 
schools, two classes were used. 

The data were collected and instructions given to the students by 
either school or board office personnel, or a research assistant. Written 
instructions were given to the administrators of the writing samples, and 
a form provided for them to report any difficulties. Four classes were 
lost, one in which the administrator allowed use of a dictionary, another 
in which the time limit was not adhered t^, and two whose school adminis- 
traticHi changed their minds about participation after the first writing. 
As there was considerable oversajnplijig in the data gathering phase, these 
losses were not a problem. 

The written sanples were checked, and any which gave away the author * 
location were eliminated from the sample. We were prepared also to 
eliminate those which, despite instructions, were too long to have been 
written in thirty minutes. As it happened, the longest selection, almost 
500 words, was written in the short time limit. Thus, it was judged not 
necessary to eliminate any for this reason. 

Eighty students (ten for each duplication) were needed to fill each 
of the country-group cells for the eight duplications. After removal of 



16 



- 15 " 



data as described above, and removal of single, impaired samples, from 
85 to 140 students remained for each of the cells. The design cells were 
filled by randan selection from the data pool. 

Sixty- four judges, 32 from each country, were selected through 
personal recommendation of school board personnel and acquaintance with 
the investigators. All were qualified, experienced teachers of grade ten 
English. Approximately 80 teachers were approached to fill the 64 vacancies 
Once they had agreed, no one backed out. The teachers were paid an 
honorarium for their cooperation. 

SC ORING 

Judges were given no specific instructions for scoring. In keeping 
with the intention of generalizing as much as possible to everyday class- 
room practice, they were asked to grade on the scale F, D-, D, ... A, A+ 
according to the inplicit standards the> had developed in their experience 
with grade ten students. Judges were given the topic title, but no 
information on time limits or the origins of the students in the sample. 
Papers were presented in order, according to the student numbers, 1-40, of 
Figure 1, with coding disguised. Judges were not instructed on order of 
marking, and most of the judges mailed back the papers in a different 
order than the one in which they were sent out. 

Lack of information on the tijne limits made the situation slightly ^ 
artificial, but given the circumstances, this seemed a reasonable compro- 
mise. For analysis, scores were converted to a scale of 1-13, ranging 
from F to A+. 



- 16 - 

ANALYSIS (MODEL 1) 

The design as depicted in Figure 1 may be conceptualized as a Latin 
Squares design with repeated measures (Winer, 1971, page 711). Altemativel 
the designation of Groups may be eliminated, and the design viewed as a 
balanced incomplete design, with the incompleteness resulting from the fact 
that no student approached the same topic under both time limits. It is 
this impossibility which confounds the factors, and makes the analysis less 
than straightforward. Temporarily ignoring factor L, Figure 2 gives 
additive components of the design for the other seven factors. Millman 
and Glass' (1967) terminology has been used. To simplify depiction of the 
expected mean squares of the sources, the usual "Within" and "Between" 
distinction has not been used. Also, in order to simplify and clarify the 
introduction of the time limit factor into the design, a complete model is 
presented. That is, all non-interpretable higher order interactions have 
been left in the model, and have not been pooled. IVhile this limits the 
power of the design to find significant effects, such findings are not of 
major concern. For consistency and ease of reference, names of sources are 
arranged alphabetically, on each side of the colon. The only exception is 
in Model 1, where D (duplications) is treated as if it were in the position 
of R (replications) in the alphabet. This simplifies comparison of Models 
1 and 2. 

[Insert Figure 2 & 5 about here] 
By ignoring all but the three factors involved. Figure 3 shows the 
relationship between G, T and L. Each is confounded with the interaction 
of the other two. The Group variable is not, of course, of any substantive 
interest, so it would be preferable if interpretations of the sources of 



IS 



- 17 - 

' variation could be converted from "Group" to "Time Limit" terminology. The 
additive model of Figure 2 is only one possibility for the analysis. As a 
logical alternative, Topics could have been ignored, and Limits used in 
designating the ccwiq)onents . Had this been the case, our interpretation 

ft 

problem would involve tryir*g to eliminate Groups in favor of Topics, rather 
than, as at present, in favor of Limits. 

Confounding means that the factors cannot be distinguished from each 
other, either conceptually or analytically. While it is clear that Linhart 
(1978) understands this, her use of the term 'aliases' to describe the 
relationsliip between, for example, G and IT, is potentially misleading. In 
switching from one name to another, certain assumptions are made which can 
be exposed, and tc some extent, analyzed. The translation is con5)licated 
by the nesting of the present design. 

Figure 4 gives a breakdown of the terms vtfiich are confounded due to 
nesting, within the five sources from Figure 2 which involve G on the left 
of the colon. Applying the simple rule of eliminating G suggested by the 
relationships depicted in Figure 3, we can "translate" each of these terms 
to the expressions found on the right hand side of figure 4. This process 
produces a list of. sources which range from main effects (LJ to six-way 
interactions (CJKLDT) . Rational. interpretation of a nested source, such 
as G:CD assumes that the confounded sources, GD, CG and CGD are all zero. 
Parallel reasoning applied tg the sources on the right hand side of the 
table suggests that higher order interactions be allowed to "defer" to 
lower order interactions or main effects. This allows us to allocate variance 
attributed to, for example, G;.CD directly to LT, under the assumption that 
LDG, CLT and CLDT are equal to zero, 

[Insert Figure 4 about here] 

- 19 



• 18 - 



'rhe reasonableness such assunptions can be verified, in part, by 
further analysis. Table 1 presents some results for an investigation of the 
assumptions discussed above. Temporarily assuming a completely crossed 
five-way design, the data was analyzed as a C by K by L by D by T design. 
This makes the false assuin)tion that the groups nested within ccuntries (C) 
and duplications were, in fact, composed of the same people. In this 
analysis, presented only for two sources drawn from Figure 4, G:CD and 
GT:CD, we can verify that the sums of squares add as expected, but that the 
assumption that higher order interactions are less than those of lower 
order does not hold. In fact, the Group effect, G or LT, is by chance, 
precise]y zero. The iirqilications are: one, that interpretations of lower 
order interactions and main effects are dubious at best; and two, Ntodel 1 
is only partly fuiictional for answering the questions asked of it. 

[Insert Table 1 about here] 

Large sample size and luck rather than planning made abandonment of 
Model 1 an option. Discussion of this model will be pursued for two 
reasons: one, the discussion should prove beneficial to those whose 
circumstances do not allow alternative analyses; and two, the problem is 
inherently interesting. 

Returning to Figure 2, five of the sources remain nested within G. 
In attempting to translate these, we would have to introduce complex 
terminology nesting students within LT interactions. Students are 
crossed with both L and T, so such a translation is misleading at best, 
and probably not permissible. Note, finall/, that the confounding of T 
with GL exists as well. 

The translation is an aid to understanding and interpretation only, 
and can in no way affect the calculations which follow. Thus, use of 
the ^^llman and Glass (1967) rules of thumb for EMS equations must incorporate 



ERLC 



20 



- 19 - 

G rather than L, as inuch the Rentz (1980) rules for VCE's. As an aside, 
note that straightforward application of the Millman and Glass rules to 
Linhart's analysis problem produces her summary table. 

Expected contributions to mean square values are summarized in Table 4. 
This Table follows Millman and Glass (1967) , and assumes that S, J and D 
are random, while C, G, K and T are fixed. While it may seem that G shoi-ld 
be considered a random factor, it is being considered as confounded with 
the interaction of two fixed factors, L and T. Rentz (1980) suggests that, 
for calculation of VCE's, all factors may be considered random initially, 
as genevalizability coefficients can be calculated later under the assumption 
that xactors are either fixed or random. However, there is no possibility 
of either K or C being considered random, and although Topics could logically 
be considered random, Winer (1971) suggests that it must be considered 
fixed in order that the assun^Jtions involved in the G, L and T translation 
be reasonable. 

[Insert Table 2 about here] 

Although cOTiplex enough. Table 2 does not provide the coefficients by 

v^ich each of the sources of variance are multiplied in the expectancy 

table. These are arrived at by multiplying together, for each term in the 

BIS expression, the number of levels of each factor which does not appear in 

the designation of the term. For example, for source ]8, 

GK:CD, E(MS„^._.) - jst o^^„ ^ + it o^^, + s + o2 

GK:CD GK:CD KS:CGD " GJ:CKDT JS:CGKDT- 

Figure 5 gives appropriate calculations for both tests of significance 

and variance component estimates. The tests of significance are not an 

integral part of the approach taken in this study, but they can be easily 

calculated on the way to the VCE's. For DOth F- and quasi-F tests, the 



ERIC 



21 



- 20 - 

numerator becomes the additive component, and the denominator the subtract ive 
component of the VCE. The divisor consists of the product of the number of 
levels of factors not included in the name of the source for which the 
calculation is being done. For exajiple, for source 14, CDT, the quasi-F 
ratio is formed by: 

(MScDT * ^JS:CGKDT^/^^J:KDT * ^ST:CGD^ 
and the VCE by: 

(^CDT * ''^JS:CGKDT " ^J:KDT " ^STrCGD^/^j^^ 
The value gjks equals 80. 

[Insert Figure 5 about here] 
Still within the confines of Model 1, an alternative method of 
analysis, involving a different treatment of the duplications, was considered, 
and judged inferior. It is possible to ignore the duplications factor, 
perform eight separate six-way analyses, and average the results. Such a 
method should, because of the averaging, yield highly stable estimates of 
the variance components. We report this analysis only for the light it 
may shed on variance cc»nponent estimates, and the inner workings of 
ccaiplex analyses of variance. 

This alternative approach is based on the additive components displayed 
in Figure 6. The first insight, not major in the present context because of 
the low priority we placed on tests of significance, is that all of the sources 
in Figure 5 vliich are tested against their interaction with D have to be 
approached through the use of a quasi-F test in the six-factor analysis. 
The second insight involves the relationship among the VCE^s as calculated 
by the two methods. The eight extra sources in the seven-way analysis are 
D and the interactions of D, with combinations of the totally crossed factors, 



ERLC 



• 22 



- 21 - 

C, K and T. (L must be ignon^d in this discussion, v*iich is in terns of G 
only) . The renaining sources give identical VCE's in both analyses, xAiile 
each source involving crossed factors in the six--way analysis gives a VCE 
equal to the sun of the corresponding VCE fron the seven-way analysis 
plus .Its interactiOTi with D. lhat is, for exarrple, VCE (CK, six-way) 
= VCE (CK, seven-way) + VCE (CKD, seven way) . The regaining VCE, D itself, 
directly reflects the differences in duplications. 

(Insert Figure 6 about here) 

It seenis therefore, that there is little advantage to be gained by the 
averaging of the eight six-way analyses. The expected stability of such 
averaging turns up in the seven-way design. We not only get a direct 
estiinate of stability in the D factor itself, but we also remove randan error 
fron our estimates of C, K, T and their interactiais . Failure to rerrove 
random error fron VCE's for fixed factors would r in tiim, inflate general- 
izability coefficients. 

imuisis {myEL 2) 

The relationship between Model 1 and MDdel 2 is best understood by 
considering a set of four duplications, say 1 to 4. Sets of results are 
selected from these four duplications in the following manner: T^L^^ fron D^; 
T1L2 from D2; T2L1 fron D3; and T2L2 fron D4, producing the data set as 
depicted in Figure 7. Judges and students have been given their seqiaence 
nijTt)ers in the total design, rather than within duplication, to clarify 
the rearrangement. Also, the reason for nunbering out of sequence in Figure 1 
should new be clear. 

(Insert Figure 7 about here) 



ERLC 



^ .23 



- 22 - 

In this new design, Limits and Topics are corpletely crossed, with 
each of the LT coribinations produced by iridependent sets of judges arxi 
students. Ihe proxy, G, is no longer needed. 

This arrangonent (called Replication 1) has used one half of the data 
fran one half of the students in E)uplications 1 to 4. One half of the data 
fran the other students forms a second, independent set of data similar to 
figure 1 (Replication 2) . Thus far, we have used one half of the data from 
all students in Explications 1 to 4. Ihe other half of their data fonns 
two nore replications, independent of each other, but not indepontent of 
Replications 1 and 2. Ihe non-independent partitioning will be referred to 
as division into Experiments 1 and 2. Ihus, Dt^^lications 1 to 4 have yielded 
Replications 1 and 2, independent of each ether within Experiment 1, but 
dependent on Replications 1 and 2 of Fvperiirent 2, v*iich are formed by the 
other half of the data fran the same students. 

/• 

A corresponding process in Duplications 5 to 8 yields Replications 3 and 
4 for each of Ejq^eriments 1 and 2. Figure 8 clarifies the realligment of 
the data. Ihe reader should note that this partitioning of the data is not 
unique. 

(Insert Figure 8 about here) 
Ihe new design removes the confounding, allows ccrplete crossing of L 
and T, and sijtplifies analysis. Much of the preceding discussion on Model 1 
is applicable to Model 2, so that detailed discussion of Model 1 is not redun- 
dant. Figures 9 and 10, and Table 3, for Model 2, correspond to Figures 2 
and 5, and Table 2 for Model 1. These give, respectively, additive ocmponents, 
calculations for VCE's and F-tests, and E>B expressions. 



24 



- 23 - 



[Insert Table 3 and Figures 9 and 10 about here] 

RESULTS 

Consistency of Judges - The judges were quite inconsistent in their 
scoring^ as would be expected based on the literature. Using the 13-point 
scale for averaging, and then converting back to the nearest letter, the 
average grade awarded by a judge over the 40 papers graded varied from a 
'D* to a *B'. In average grade awarded, the 64 judges broke down as 
follows: . D,5; EM-,7; C-,13; C,18; C+,12; B-,6; B,3. These results show 
great differences in expectations of experienced, qualified teachers of 
grade ten English. 

The above analysis tends to exaggerate the differences in grading 
among teachers. Even though the sets of students were formed randomiy, 
the sets would be expected to differ in performance. Thus, not^ all of the 
variation reported in the above paragraph is due to inconsistent grading 
^ practices. However, sixteen sets of four judges graded the same forty 

papers. Within these sets, any spread across judges i? a direct reflection 
of differing standards. In only one of the sixteen sets was the spread 
across all four judges near one unit on the 13-point scale. Most sets of 
judges had a spread of two, three, or four points, while in one set of 
judges the spread was five points. Given that these grades are ^averaged 
over 40 papers, the degree of inconsistency is remarkable. 

Average grade awarded is only one possible way of considering consist 
tency. Looking at the 16 sets of four judges who scored identical papers, 
,we can produce six correlation coefficients for each set, for a total of 
96. Of these, 21 are less then 0.50, 52 between 0.50 and 0.70, and 23 
above 0.70. These figures give a much more optimistic picture of judge 



- 24 - 



consistency, which, in summary, seems to be more a problem of difference 
in average grade than differences in rank-ordering. 

Summary Statistics - A summary of grades awarded, ignoring the data 
partitions of Duplications. Replications, and Experiments, is presented in 
Table 4. Column 1 gives a breakdown of the scores of the entire sample 
(5.:u students X two topics X four judges/topic) . The scores are positively 
skewed, with a mean of 5.78, between 'C- and 'C. Considerably more low 
grades have been awarded than high grades. There are more grades of 'F' 
than -A-', 'A- and 'A+' combined. As experience suggests, these results 
conf inn that it is difficult for a student to score high in English. 

[Insert Table 4 about here] 
Columns 2 and 3 give the results broken down by time limits. Against 
expectations, there is a small difference in grades in favour of the short 
time limits. Since students were writing under the understanding that the 
grades would "count", this cannot be explained by suggesting that students 
would not bother with any effort at home merely to cooperate with a research 
project. The difference in grades under the two conditions is quite small, 
and probably of no practical significance. Such an unexpected finding will 
be discussed below. It would be an interesting extension of this study to 
see if lengthening the time limits (and the length of the writing sample) 
might change this tentative conclusion. For ^xainple. would grades <iiffer- 
entiate between a one hour time 1 'jnit and a one week time limit. Also, 
this study does not look at the length of production under different time 



limits. 



^ Columns 4 and 5 give a breakdown for the two topics. There is a 
difference in grades, in favour of the descriptive topic. Unlike the time 



26 



- 25 - 

limit effect, inspection of the pattern of grades shows a tr^nd in the 
difference. The descriptive topic was awarded fewer grades in the 'range 
D+ to B-, and itiorc in the B to A range. The number of very low grades 
was about equal in both cases. 

Although "Country" has been so dubbed, we are considering students 
from only one state and^ne province. This, of course, limits generaliza- 
bility of the findings. 'Columns 6 and 7 demonstrate that the Newfoundland 
students scored considerably higher than the Iowa students. The difference 
a^ipears to be real, statistically significint, and large enough to be 
educationally important. The cause, however, is open to question. Matching 
of the two groups was done roughly, on the basis of school and conmunity 
size, and on the basis of standardized tert scores available for earlier 
grades in the same communitie The finding could be attributed to a poor 
match across countries. Or, since the matching was on scores from earlier 
years, the higher dropout rate in Newfoundland could lead to a better ^' 
studer;, on average, remaining in school there. As a closely related 
alternative, the schools in Iowa are in most cases composite high schools. 
IXie to widespread population in Newfoundland, a system of regional voca- 
tional schools draws vocational students away from the largely araH.emic 
high schools. Again, the population in the Newfoundland schools could, on 
average, be more academic lhan in Iowa. Choosing among these explanations, 
and the controversial and unlikely altenidtivp~^aTNewfo"undTanT^ucation 
is simply better, is without further investigation, merely speculation. 

On average, Newfoundland students scored more than 0.50 points higher 
than Iowa students, on the 13-point scale. Inspection of the table shows 



ERIC 



27 



- 26 - 

that they were awarded far fewer failing grades, which was compensated 
for by more grades in the range C- to There were about equal numbers 
of good grades. 

Columns 8 and 9 report the grades awarded by judges from the two 
countries. As can be seen, Iowa judges, on average, gave scores a full 
point higher than Newfoundland judges. The awarding of far fewer failing 
grades by the Iowa judges is balanced by more very good grades. Numbers 
of grades in the average range are about equal. 

Individual variations among judges are very large. Four judges 
gave failing grades to 15 or more of the 40 papers they scored. All were 
from Newfoundland. Twenty judges gave no failing grades, and 17 of these 
were from Iowa. At the other end of the scale, considering A-, A, and A+ 
as one grade of 'A', only one judge, from Iowa, gave more than ten A's- 
twelve. Ten judges gave no A grades, nine from Newfoundland. 

This phenomenon reflects, we believe, not so much a difference in 
expectations of quality of work, but in the language through wh^ch work is 
- judged. That is, Iowa judges tend to consider a "typical" or average paper 
to be worth a higher grade on the F ... A+ scale. As long as the language 
and e\pec tat ions are understood, and internally consistent, no problems 
arise. Problems do arise, however, when between country comparisons are 
made. 

It is important to realize that, for the values in Table 4, judges 
from both countries scored essays from both countries, but were unaware 
of country of origin of the papers. If the lowa papers had been marked 
only by the Iowa judges, scores would not be comparable with Newfoundland 
papers marked only by Newfoundland judges. This seems an important finding, 
and is worthy of further investigation. 



ERIC 



28 



- 27 - 

Analysis of Variance - Before abandoning Mocael 1, vfe argued for inclu^ 
sion of the Diplication effect in the ANOVA, The EXiplications were independent , 
as are the Replications of Model 2. The Ebq^eriments of Model 2 are not 
indepenc3ent, hcMever, as they involve the sane students' data, and thus 
should not be treated in the same manner as Replications. Replicatiois , 
as a factor, has been included in the design of McxJel 2. In principle, 
E)?)erinBnts could be included as vs^ll, as a factor distinct from Replications, 
but this would produce a sumnary table with more than 50 sources, and might 
lead to difficulty in interpreting VCE's due to the non-independence problem. 
Thus, Ejq^eriments 1 and 2 are reported separately and averaged for calciilation 
of generalizability coefficients. 

Interest in Model 1 is confined largely to the ANOVA questions raised 
in the preceeding discussion, but results are included for the sake of 
ocnparison. Results of the analysis of variance are reported in Tables 
5, 6 and 7 for Model 1, Model 2, (E^) , and Model 2, (E2) respectively. 
The fact that sums of ^iquares for seme of the crossed effects, such as C or 
K, do not sun in and E2 from Model 2 to the value found in Model 1 can 
be attributed to the unexamined Experiment effect, and its interaction 
with C and K. The discussion parallels that involving the designs in 
Figures 2 and 6, involving emission of the Diplication factor. 

(Insert Tables 5, 6 and 7 about here) 

Significant effects were not consistent across the three analyses. 
Using a cx)nservative^evel of 0.01, only the student and judge effects were 
significant in all three cases. Under Model 1, there were significant effects 
as well for K, the country of judges, and the CJ and ST Lnteractions . Both 
Experiinents of MDdel 2 shewed significant effects for the JL interaction. 



ERLC 



29 



4 0 

- 28 - 

and MDcSel 1 for tha LRT interaction. 

Uie most disturbing characteristic of these data is the lack of 
consistency of results across axd a rancJon partitioning of the data. 
Fbr example, the top half Table 8 shews a breakdown of means for the 
U interaction, which was significant at the 0.01 level in both E>q)eriments. 
Inspection of the cell means shows that, in fact, these two highly significant 
interactions are in c^^site directions, on average cancel each other, 
and are nothing more than artifacts of this particular randan splitting of 
the data. This is even more disturbing vAien on 3 considers that either half 
of the data set is still quite large, larger than the data sets on v*uch are 
based many conclusions in the literature. 

With a smaller data set, of course, differences of the size found here 
mi^^t not be judged significantly different. The logical outocme of this line 
of reasoning is that we should be reporting effect sizes rather than levels 
of significance. Indeed, sucsh a stance forms much of the basis for the 
estimating of variance canponents\^ 

This particular split of the data is one of many possibilities. A few 
different partitionings of the data ha^re been carried out in preliminary 
analysis, and, in every case, "significant" interactions have been found in 
one or the other half of the data. 'Only the stxxlent and jixJge effects have 
been ocxisistent throughout this data probing. While our results seem to 
justify the general trend away from reporting of significant differences to 
reporting of effect sizes and variance ocnponent estimates, as will be seen 
belcw, little ocmfort can be found in the stability of the VCE's resulting 
fron El and £2- 

(Insert Table 8 about here) 



. 29 - 

The second half of Table 8 reports data on the interesting speculation 
that naricers frcm each country show preference either for or against student 
writing frcm their own country. While the large C effect has been discussed 
above, CK cell means might tend to show sane preference one way or the other. 
As can be seen, at least in this particular random division by E^^ and E2, this 
has not occurred. 

Table 9 reports VCE's for Model 1, and E2 of ffodel 2, and Model 2 in 
total. As irudi as possible, so»jrces fron Model 1 and 2 have been paired. 
'nx>se in brackets for Model 1 are confounded, and as has been discussed, 
cannot be estimated independently. Smilarly, those same sources, without 
brackets, contaiji cx>nfounding. These estimates should be viewed fran this 
perspective. Theoretically, VCE's for totally crossed factors, such as C 
differ in Model 1 frcxn Model 2 by an amount equal to the interaction of that 
factor with the uncalculated Experiment effect, in this case, CE. Fiarther 
discussion and generalizability coefficients will use Model 2 data only. 

The variabililY in VCE's across the two Experiments of Model 2 parallels 
that in the ANOVA sumvary tables. There are, as expected, o^nsistent estimates 
for J, S, and the JS interaction, although the VCE for the judge effect is 
considerably smaller than in Model 1. In each Experiment, there are large 
effects which are not di5)licated in the other e^qjeriment, notably K, KLR, 
LRT, LR and LT (these latter two being very large negative in Ei) . It should 
be noted that interdependence of E^ and E2 confounds these between Model 
ocnparisons. 

While the number of negative VCE^s suggests that the model may be 
inapprc?5riate, and might be iirproved by elimination of many of the higher order 
interactions, this route was avoided in order to facilitate ccnparisons of 



31 



- 30 - 

the twD mDdels. In general, negative VCE*s recur, and in those cases v*iere 
a negative from one ntodel is paired with a positive fron the other, both are 
anall. There are exceptions, as noted. While the two values for the student 
and judge effects vary enough that generalizability coefficients calculated 
on the basis of each model differ substantially, they at least are of the 
same order of magnitude, Ger^ralizability coefficients will be calculated 
on the basis of each experiment of Model 2. Negative \/CE's will be replaced 
by zero. 

This study was reasonably well designed for estimating variance 
ocnponents of a great many factors and for gathering information on the 
stability of these estimates. Ho^e^/er, in retrospect it appears as if the 
design is not amenable to the production of many generalizability coefficients 
which will in turn shed light on issues of essay grading. Model 1 nested 
judges within topics, and Model nested both judges and ^students within 
topics. Thus, no ooninents can be made about how generalizability coefficients 
vrould vary with manipulation of nimber of sartples and judges. A few calcula- 
tions, nevertheless, can be made. Taking SiCliRT as the facet of differentiation, 
with K fixed, cwily six of the 38 sources not inclvxied in the true score, 
or nimsrator of the generalizability ^coefficient (Rentz, 1S80) . Of these six, 
four are interactions of randan factor JrKRT with coiponents of true score, 
and the renaining two are K and JrKRT itself, ihis discussicri assimes division 
by ^ropriate divisors in each case. The only c^Jtions seem to be v*iether to 
inciuae in the denaninator JrKRT or not. If we ignore JrKRT, (and K) , 
— presimably ii^ have an estimate of generalizability ^nross students ignoring 
judge variability, that is, for the score givrni by an average judge, or the 
sun sc?ore of four judges. These valiies, for Ej^, and the average of Model 2, 

ER?C oo 



- 31 - 

are 0.9S, 0.93 and 0.92. On the other hand, if we include JrKRT in the 
denaninator, this gi\es a cxDefficient genercilizing over students for the 
grade assigned by a randan judge fron a particular country. In this case, 
the values just quoted drop to 0.88, 0.86 and 0.87 respectively. This 
interpretaticxi, representing a unique blend of Rentz, Cronbach et al. and 
Linhart, is oper to discussion. 

DISCUSSION PND OONCUJSIOKS 

We made no attanpt to read the literature in oonparative education. 
Ihus, we dc not know the iitportanoe of the two country effects. Certainly, 
the ooiintry of student effect would have to be investigated thoroughly, in 
an investigation specifically designed to do so, before any conclusions could 
be drawn, fti the basis of this stucfy, the result obtained must be attributed 
to sanplc miatatch betv^een the two countries on the basis of selective 
attrition, differences in standardized test validity^ or sane such. Ttye 
country of judge phenomenon is pix4>ably more firmly st^^rted by our data, 
and as such is worthy of further investigaticm. 

Student and judge variability was, as fully e^q^ected, large. Our 
generalizability coefficients, interpreted as four judges scoring one writing 
sample, are larger than earlier reported. Given the nesting of factors in 
this design, we can put little faith in dftrect comparisons of our coefficients 
with those fron the literature. Our design was unable to address the usual 
question of sttdies sudi as this, the relative efficiency of increasing 
writing samples or Jixi^ in estiitating writing ability. 

The time limit effect was one of the major objects of investigation. 
V»iile this was initially one of the goals, as analysis proceeded, it diminished 

ErJc 33 



- 32 - 

in importaioe. The time limit results are unexpected, and as such, we 
doubt their validity. As one reviewer of the study suggested, our treatment 
may not have been strong enough. Students my be so out of practice at 
doing honewrk that they would revise a paragr^ at hone only if it clearly 
was not "finished" (perhaps, as a cynic suggested, as indicated by the absence 
of a period after the last word) . Further investigation of the time limit 
phenoroencai should utilize a greater distinction in conditions, perhaps putting 
the longer, untimed conditions first in order to remove the perception that 
what is required in the second situation is merely a mirwr modif icatiai of 
an in-class, timed work. 

Hie ANCWA situation encountered interfered with substantive issues, 
hut was of interest itself. The solution proposed, Itodel 2, solved the 
confounding problems, with seme sacrifice, notably loss of within student- 
across topic information. Also, by forcing the nesting of students within 
topics, it complicated alreacty ocnplex generalizability coefficient calculations. 

We oOTsider the most inportant outocmes of this study to be related to 
the wcark of Smith (1978) . Smith has cbne substantial work on the stability 
of variance ocnponent estimates, and reports two general conclusions: first, 
that stability decreases as the caiplexity of the design and expected mean 
square expressions increases; second, stability increasefj as the nunfcer of 
levels of each source increases. By anith's criteria, this study fails on 
both counts. More ocroplex designs than this are fev' and far between, aixJ 
iTost of our factors were r^resented by only two levels. Had we been aware 
of Smith's work earlier, we voald have proceeded differently. However^ all 
is not lost. As we mentioned in passing, our splitting of the data into 
El and E2 was done in one of many equally acceptable ways. Each split will 



ERLC 



34 



- 33 - 



yield tvro sets of variance cscmponent estimates, producing a source of data 
for a study of the stability of the VCE's under the circumstances of this 
design. Siiice such sets of estimates would not be independent, the product 
would not be as useful a tool as a Monte Carlo study, but it would have use 
as a generator of a distiibution of VCE's around a "true" value. 

Having found a way to circumvent the confounding, we are confident 
that other, yet to be discovered methods of cutting our data will yeild more 
fruitful results. While flaws in this study require that further data be 
collected with different designs to answer the questions posed, the 
potential of the present data set clearly has not yet been, exhausted. 



ERIC 



35 











^2 








h 




4 


^2 












^5 






c, 

1 


1 


h 

ho 


1 


1 


4 


L 

•-2 


^2 


hi 


1 

•-2 


•-2 


4 


4 






hi 














s 


^20 


4 


4 


4 










^31 














^4 


^40 


4 


4 ' 


4 




4 



Figure 1 

One Duplication of the Design, Model 1 
Students are numbered out of sequence to assist in conversion to Model 2. 



36 

ERIC 



Additive Components, Model 1 
(Without "Limits") 



Source degrees of freedom 



C 1 

K 1 

D 7 

T 1 

CK 1 

CD 7 

CT 1 

KD 7 

KT 1 

DT 7 

G:CD 16 

CKD 1 

CKT 7 

CDT 7 

KDT 7 

S:CGD 288 

J: KDT 32 

GK:CD 16 

GT:CD 16 

CKDT 7 

CJ:KDT 32 

KS:CGD 288 

ST:CGD 288 

GKT:CD L6 

GJrCKDT 64 

KST:CGD 288 

JS:CGKDT 1152 



TOTAL 2559 



Figure 2 





'l 






4 


4 




4 


4 




4 


4 




^1 


^2 




4 


4 


=2 


4 


4 


^2 


^2 






4 


4 



This diagram ignores factors C, J, K, D, S 



Relationship Between "Group", "Limit" and "Topic" Effects 

Figure 3 



ERIC 



• 38 



Relat1on«;hip between "Groups" 
and "Limits", Model 1 



Source 



Confounded 
Sources 



Translation 



Confounded 
(assumed = 0) 



G:CD 



G 

GD 
CG 
CGD 



LT 



LDT 
CLT 
CLDT 



GK:CD 



GK 
CGK 
GKD 
CGKD 



KLT 



CKLT 
KLDT 
CKLDT' 



GT:CD 



GT 
CGT 
GDT 
CGDT 



DL 
CLD 



GKT:CD 



GKT 
CGKT 
GKDT 
CGKDT 



KL 



GKL 
KLD 
CKLD 



GJrCKDT 



GJ 

CGJ 

GJK 

GJD 

GJT . 

CGJK 

CGJD 

CGJT 

GJKD 

GJKT 

GJDT 

CGJKO 

CGJKT 

CGJDT 

GJKDT 

CGJKDT 



JL:KDT 



JLT 
CJLT 
J KLT 
JLDT 

CJKLT 

CJLDT 

CJL 

JKLDT 

JKL 

JLD 

CJKLDT 
CJKL • 
CJLD 
JKLD 
CJKLD' 



Figure 4 

39 



Tests of Significance and, VCE. Denominators 
Model- y 



Source 



MS's, Numerator MS's, Denominator VCE Denominator 



C 
K 
0 
T 

CK 
CD 
CT 



8 KD 

9 KT 

10 DT 

11 G:CD 

12 CKD 

13 CKT 

14 CDT 

15 KDT 

16 S:CGD 

17 J: KDT 

18 GK:CD 

19 GT:CD 

20 CKDT 
'21 CJ:KDT 

22 KS:CGD 

23 Sf:CGD 

24 GKTrCD 

25 GJrCKDT 

26 KST:CGD 

27 JSrCGKDT 



C 
K 

D, JS:CGKDT 
T 

CK 

CD, JS:CGKDT 
CT 

KD, JS:CGKDT 
KT 

DT, JS:CGKDT 
G:CD, JS:CGKDT 
CKD, JS:CGKDT 
CKT 

CDT, jS:CGKDT 

kdt, js:cgkdt 
su:gd 

J: KDT 

GK:CD, JS:CGKDT 
GT:CD, JS:CGKDT 
CKDT, JS:CGKDT 
CJ:KDT 
KS:CGD 
STrCGD 

GKTrCD, JS:CGKDT 
GJiCKDT 
KST:CGD 
JSrCGKDT 



CD 
KO 

SrCGD, JrKDT 

DT 

CKD 

SrCGD, CJrKDT 
CTD 

JrKDT, KSrCGD 
KDT 

JrKDT, STrCGD 
SrCGD, GJrCKDT 
CJrKDT, KSrCGD 
CKDT 

JrKDT, STrCGD 
JrKDT, KSTrCGD 
JSrCGKDT 
JSrCGKDT 
KSrCGD, GJrCKDT 
STrCGD, GJrCKDT 
CJrKDT, KSTrCGD 
JSrCGKDT 
JSrCGKDT 
JSrCGKDT 

GJrCKDT, KSTrCGD 

JSrCGKDT 

JSrCGKDT 



1280 
1280 
320 
1280 
640 
160- 
640 
160 
640 
160 
• 80 
80 
320 
' 80 
80 
8 
40 
40 
40 
40 
20 
4 
4 

20 
10 

2 

1 



Figure 5 

I' 



40 



Alternative 6 Way Ana'lysis 



Source 

K 
T 

CK 
CT 
KT 

G:C (LT) 
CKT 
S:CG 
J:KT 

GK:C (KLT) 
t!T:C (L) 
CJ:KT 
KS:CG 
ST:CG 
GKT:C (,<L) 
GJrCKT (JL:KT) 
KST:CG 
JS:CGKT 



d.f. 



2 
1 

36 
4 
2 
2 
4 
36 
36 
2 
8 

36 

144 



Figure 6 



319 



41 







"^1 








^1 


^2 


h 


^2 






^1 




^1 


^2 


^21 ^22 


^^23 ^^24 




^1 


^1-10 


^1-10 


^101-110 


c 




^11-20 


^11-20 


^111-120 


^111-120 








"^10 




^12 


J29 J30 


J3I J32 


^2 


^1 


^61-70 


^61-70 


^121-130 


^121-130 




^2 


^71-80 


^71-80 


"131-140 1 


^131-140 



Figure 7 

One Replication of one Experinent, {E^) based 
on one quarter of Duplicatic»is 1 to 4, Model 2 



4'4 



D T L 



R, R3 



«1 ^ ^ 



1 1 



1 
2 
1 
2 



X 



X 



X 



X 



2 1 



1 
2 
1 
2 



X 



1 
2 



1 
2 
1 
2 



X 



4 1 



1 
2 
1 
2 



1 
2 



1 
2 
1 
2 



1 
2 



1 
2 
1 
2 



1 
2 



1 
2 
1 
2 



1 
2 



1 
2 
1 
2 



X 



ERIC 



Figure 8 
OonvBTBion of Model 1 to Model 2 



4a 



Additive Components 
Model 2 



Source df 



C 
K 
L 
R 

T 

CK 
CL 
CR 
CI 
KL 
KR 
KT 
LR 
LT 
RT 

CKL 
CKR 
CKT 
CLR 
CLT 
CRT 
KLR 
KLT 
KRT 



LRT 3 

J:KRT 16 

CKLR 3 

CKLT 1 

CKRT 3 

CLRT 3 

KLRT 3 

SrCLRT 288 

CJrKRT 16 

JL:KRT 16 

CKLRT ' 3 

KSrCLRT 288 

CJL:KRT 16 

JS:CK'lRT 576 



Figure 9 



Tests of Signifi 



Source MS's, Numerator 



1 




b 1 UK 1 


2 






3 


L 


1 1 RT 
L ) Lr\ 1 


4 


R 


D 

r\ 


5 


T 


T 
1 


6 


CK 


CK PKRT 

U In 9 l/i\r\ 1 


7 


CL 


CL. CLRT 


8 


CR 


CR 


9 


CT 


CT 


10 


KL 


K[ K[ RT 


11 


KR 


KR 


12 


KT 


KT 


13 


LR 


LR 


14 


LT 


1 T 

L 1 


15 


RT 


RT .K • CV\ RT 


16 


CKL 


PKI CKI RT 


17 


CKR 


PKR 


18 


CKT 


CKT 


19 


CLR 


ri R 


20 


CLT 


ri T 


21 


CRT 


CRT .K • ri^! RT 


22 


KLR 


- KLR' 


23 


KLT 


KLT 


24 


KRT 


KRT, JS:CKLRT 


25 


LRT 


LRT, JS:CKLRT 


26 


J: KRT 


J: KRT 


27 


CKLR 


CKLR 


28 


CKLT 


CKLT 


29 


CKRT 


CKRT, JS:CKLRT 


30 


CLRT 


CLRT, JSrCKLRT 


31 


KLRT 


KLRT, JS:CKLRT 


32 


S:CLRT 


S:CLRT 


33 


qj:KRT 


CJ:KRT 


34 


JL:KRT 


JL:KRT 


35 


CKLRT 


CKLRT, JS: CKLRT 


36 


KS:CLRT 


KSrCLRT 


37 


CJL:KRT 


CJL:KRT - 


38 


JS: CKLRT 


JS: CKLRT 



ance and VCE Denominators 
Model 2 



MS's, Denominator VCE Denominator 



CD CT 


640 




540 


ID IT 

LK, LI * 


540 


DT 
K 1 


3^:0 


K 1 


540 


ruD ri'T 
LnK, LKI 




n R n T 




PDT 
uK 1 


1 en 
15U 


PDT 
uK 1 


1 cn 
150 


ui n U| T 




^DT 
I 


15U 


^DT 

nK I 




1 DT 


1 cn 
150 


1 DT 

LK 1 


320 


1 • ^DT C • CI DT 

u : 1 , 5 : llki 


160 


rui D rui T 
UNLKy LnL 1 


150 


r^DT 
LnK I 


on 

80 


LKKT 


160 


CLRT 


80 


n DT 

LLK 1 


Ibu 


S:CLRT, CJ:KRT 


SO- 


KLRT 


SO 


KLRT 


160 


J:KRT, K^:CLRT 


SO 


Sf^LRT,/OL:KRT 


80 


JS:CKL7T 


40 


CKLRT/' 


40 


CKLRT 


ao 


CJ:KRT, KS:CLRT 


40 


S:CLRT, CJL:KRT 


40 


OLzKRT, KS.-CLRT 


40 


JS: CKLRT 


4 


JS:CKLRT 


20 


JS: CKLRT 


20 


KS:CLRT, CJL:KRT 


20 


JS: CKLRT 


2 


JS: CKLRT 


10 



1 



gure 10 



Table 1 

Investigation of Translation Assumptions, Model 1 



Group Version 



Limit Version 



Source SS df MS 



Source 



SS 



df 



MS 



G:CD 336.4 16 



21.0 



LT 

LDT 

CLT 

CLDT 



0.0 
210.7 
21.8 
103.9 
336.4 



0.0 
30.1 
21.8 
14.8 



6T:CD 195.9 16 



12.2 



L 

CL 
LD 
CLD 



9.5 
10.0 
P?.0 
112.4 
195.9 



1 
1 
7 
7 



9.5 
10.0 

9.1 
16.1 



ERIC 



V6 



Table 2 

Expected Mean Squares, Model 1 

- Source No. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 

) 

X 
X 
X 
X 
X 
X 
X 
X 
X 
X 
X 
X 
X 
X 
X 
X 
X 
X 
X 
X 
X 
X 
X 
X 
X 



1 c 


X X 


# 


X 




X 








2 K 


X 


X 




X 




X 






3 T 


X 


X 




X 






X 




4 D 


X 




X 


X 










5 CK 


X 


X 






X 


X 






6 CT 


X 


X 






X 




X 




7 CD 


X 




X 




X 








8 KT 


X 


X 




X 








X 


9 KD 




X 




X 




X 






*0 DT 




X 




X 






X 




11 G:CD 




X 


X 








X 




12 CKT 




X 






X X 






X 


13 CKD 




X 




> 


X 


X 






14 CDT 




X 






. X 




X 




15 KDT 




X 




X 








X 


16 S:CGD 






X 












17 J^:KI>T 








X 










18 GK:CD 








X 




X 


X 




19 GT:CD 










X 




X X 




20 CKDT 










X X 






X 


21 CJ:KDT 










X 








22 KS:C6D 












X 






23 ST:CGD 














X 




24 GKT:CD 














X X 


X 


25 GJrCKDT 














X 




26 KSTrCGD 
















X 


27 JS:CGKDT 



















S, and D random, C, G, K, T fixed 



47 



Table 3 

Expected Mean Squares, Model 2 



Source No. 1 2 3 4 5 <; 7 8 9 10 11 12 13 14 15 16 17 13 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 

X X X 
X XX XX 

X XX X 

X X X 

X X X 

X XX X 

X XX X 

X X 



1 


c 


2 


K 


3 


L 


4 


R 


5 


T 


6 


CK 


7 


CL 


8 


CR 


9 


CT 


10 


KL 


11 


KR 


1 o 

12 


KT 


13 


LR 


14 


LT 


15 


RT 


16 


CKL 


17 


CKR 


18 


CKT 


19 


CLR 


20 


CLT 


21 


CRT 


22 


KLR 


23 


KLT 


24 


KRT 


25 


LRT 


26 


J: KRT 


27 


CKLR 


28 


CKIT 


29 


CKRT 




n RT 


31 


KLRT 


32 


S:CLRt 


33 


CJ:KRT 


34 


JL:KRT 


35 


CKLRT 


36 


KS:CLRT 


37 


CJL:KRT 


38 


JSiCKLRT 


h. 


J, S, a 



X 



X X 

X X 
X X 
X 
X 



X X 



X 
X 



32 33 


34 


35 


36 


37 


3! 


X X 










X 








X 




X 


X 


X 








X 


X 










X 


X 










X 


X 






X 




X 


X 










X 


X X 










X 


X X 










X 




X 




X 




X 








X 




X 








X 




X 


X 


X 








X 


X 


X 








X 


X 










X 






X 


X 




X 


X 






X 




X 


X 






X 




X 


X 








X 


X 


X 








X 


X 


X X 










X 




X 




X 




X 




X 




X 




X 








X 




X 


X 


X 








X 












X 






X 


X 


X 


X 






X 


X 


X 


X 


X 






X 




X 


X 








X 


X 




X 




X 




X 


X 










X 


X 










X 




X 








X 






X 


X 


X 


X 








X 




X 




J 






X 


X 












X 



ERIC 48 43 



TABLE 4 



Summary of Scores Awarded 
{% of Total Sample) 





1 


0 

c 


0 
0 


A 
*f 


C 

0 


c 
D 


7 


o 

o 


9 ^ 






CUODT 
onUK 1 


1 Cifdfl 
LUlib 


















TIME 


TIME 






NFLD. 


IOWA 


NFLD. 


IOWA 






LIM.T 


IIMIT 


TOPIC 1 


TOPIC 2 


STUDENTS 


student: 


JUDGES 


JUDGES 


>UUKL 








ll- 1 CO\J 


1 co\} 




M-1 9Qn 


IN — icOU 


H i COu 


l-F 


10.2 


9.5 


10.9 


9.2 


11.2 


6.6 


13.8 


16.3 


.4.1 


2-D- 


6.6 


6.3 


7.0 


7.6' 


5.6 


6.0 


7.2 


8.1 


5.1 


3-D 


10.3 


10.5 


10.1 


10.3 


10.3 


10.5 


10.1 


9.3 


11.5 


4-D+ 


8.2 


9.1 


7.4 


7.7 


8.8 


8.3 


8.2 


9.1 


7.3 


5-C- 


13.1 


12.3 


13.8 


12.7 


13.4 


13.6 


12.6 


12.2 


14.0 


6-C 


11.7 


12.5 


10.9 


10.9 


12.4 


11.7 


11.6 


10.9 


12.5 


7-C+ 


9.5 


8.5 


11.2 


9.5 


9.4 


10.9 


8.0 


8.5 


10.4 


8-B- 


9.6 


10.2 


9.1 


9.3 


10.0 


10.5 


8.8 


8.6 


10.7 


9-B 


7.6 


7.2 


8.0 


8.4 


6.9 


8.4 


6.9 


6.3 


8.9 


10-B+ 


5.1 


5.0 


5.2 


5.9 


4.3 


5.5 


4.7 


5.1 


5.2 


11-A- 


4.3 


4.8 


3.8 


4.4 


4.2 


4.2 


4.4 


2.7 


5.9 


12-A 


3.4 


3.7 


3.1 


3.8 


3.0 


3.5 


3.3 


2.6 


4.2 


13-A+ 


0.4 


0.4 


0.3 


0.3 


0.4 


0.2 


0.5 


0.3 


0.4 



5.78 



5.85 



5.72 



5.88 



5.69 



6.04 



5.53 



5.24 



ERIC 



50 



Table 5 
Summary of Anova, Model 1 



Source Mean Square 



\ 



1 c 


171 1891 


C IN 


77 fi finifi 








CO * \jOj\j 








• ^ ooo 


7 CT 


Q 7516 




5Q 4480 


9 KT 


159 0016 


10 DT 


82.8266 


11 G:CD{LT) 


, 21.0227 


12 XKD 


4.9337 


J3 CKT 


9.2641 


14 COT 


3.8516 


15 KDT 


90.1444 


16 S:CGD 


32.3617 


17 J: KDT 


58.4836 


18 GK:CD{iaT) 


4.4555 


19 GT:CD(L) 


12.2461 


20 CKDT 


9.4980 


21 CJ:KDT 


6.4180 


22 KS:CGO 


3.4841 


23 STrCGD 


8.2749 


24 GKT:CD(KL) 


4.3289 


25 CJ:CKDT(JL:KDT) 


3.5617 


26 KSTrCGD 


2.7716 


27 JS:CGKDT 


2.8752 



F-ratio Quasi -F-'-atio df(test) 

% 

3.15, 1,7 

13.06^ 1,7 

1.57 7,75 

0.29 1,7 

2.18 1,7 

1.48 8,305 

2.53 1,7 

1.01 8,36 

1.76 1,7 

1.28 7,42 

0.66 21,337 

0.79 17,74 

0.97 1,7 

0.46 21 ,142 

1 1.52 7,35 

11.26} 288,1152 

20.34^ 32,1152 

1.04 43,207 

1.2S 24,321 

1 1.35 12,64 

2.23^ 32,1152 

1.21, 288,1152 

2.88^ 288,1152 

1.14 44,178 

1.24 64,1152 
0.96 ^ 288,1152 



significant at the 0.01 level 



51 



TABLE «6 



SUMMARY OF ANOVA, MODEL 2, E_i 







Mean Square 


F-ratio 


Quasi-F-ratio 


df (test) 


1 


C , 


63,457 


• 


5.21 


2,4 


i 


K 


192.976 




1-.26 


1,2 






1-313 . 




9.48 


3,3 


4 


R 


9,149 


0.38 




3,3 


5 


T 


1,582 


0.07 




1,3 


6 


CK 


12,601 




2.40 


1,3 


7 


CL 


14.238 




0.78 


4,4 


8 


CR 


15.978 


0.41 . 




3,3 


9 


CT 


3.720 


0.10 




1,3 


10 


KL 


6.757 




0.54 


4,3 


11 


KR 


52.942 


1.61 




3,3 


12 


KT 


125.626 


3.83 




1,3 


13 


LR 


34.047 


0.10 




3,3 


14 


LT 


0.488 


0.00 




1,3 . 


15 


RT 


24.145 




0.56 


4,43 


16 


CKL 


0.132 


f 


0.43 


3,4 


IT 


CKR 


5.472 


5.91 




3,3 


18 


CKT 


0.176 


0.19 




1,3 


19 


CLR 


34.884 


1.59 




3,3 


20 


CLT 


11.438 


0.52 




1,3 


21 


CRT 


39.103 




1.77 


4,225 


22 


KLR 


28.640 


2.91 




3,3 


23 


KLT 


2.195 


0.22 




1,3 


24 


KRT 


32.800 ^ 




1.10 


4,20 


25 


LRT 


326.092 




6.38^ 


3,40 


26 


JrKRT 


29.702 


1 

9.021 




16,576 


27 


C3CLR 


7.170 


1.46 




3,3 


28 


CKLT 


4.632 


0.94 




1,3 


29 


CKRT 


0.926 




0.56 


58,46 


30 


OiPT 


21.947 




1.00 


4,191 


31 


KLRT 


9.832 




0.37 


5,19 


32 


SrCLRT 


19.489 


5.92-'- 




288,576 


33 


GJrKRT 


4.409 


1.34, 




16,576 


34 


JLrKRT 


32.121 


4.75-^ 




16,576 


35 


CKLPT 


4.915 




0.93 


8,38 


36 


KSrCLRT 


3.137 


6.95 


0'. 95 


288,576 


37 


CJLrKRT 


5.652 


1.72 




16,576 


38 


JSrCKLRT 


3.294 









signifxcant at the 0.01 level 



52 



TABLE 7 



Mean Squares F-ratio Quasi-F-ratio df (test) 



1,2 

1,4 

4,3 

3,3 

1,3 

4,3 

3,4 

3,3 

1,3 

4,3 

3,3 

1,3 

3,3 

1,3 

3,55 

3,3 

3,3 

1,3 

3,3 

1,3 
16,220 

3,3 

1,3 

3,37 

3,37 
16,576 

3,3 

1,3 
16,41 

3,263 

4,19 
288,576 
16,576 
16,576 

288,576 
16,576 



1 


C 


. 111.038 




1.79 


2 


K 


651.226 




5.61 


3 


L 


10.332 




1.98 


4 


R 


23.965 


0.62 




5 


T 


31.563 


0.82 




6 


CK 


1.188 




0.11 


1 


CL 


0.488 




1.05 


8 


CR 


22.688 


12.19 




9 


CT 


40.257 


21.63 




10 


'KL 


3.301 




0.18 


11 


KR 


73.597 


8.87 




12 


KT 


43.882 


5.29 




13 


LR 


53.353 


0.55 




14 


LT 


0.488 


0.01 




15 


RT 


38.659 




0.91 


16 


CKL 


0.063 




0.57 


17 


OCR 


12.259 


6.71 




18 


CKT 


15.095 


8.26 




19 


CLR 


' 2 J. 051 


0.67 




20 


(XT 


10.332 


0.30 




21 


CRT 


1.861 




0.61 


22 


KLR 


121.076 


6.34 




23 


KLT 


2.720 


0.14 




24 


KRT 


8.300 




0.40 


25 


LRT 


96.238 


1 


1.62 


26 


J:KRT 


23.868 


9.72"^ 




27 


CKLR 


9.455 


1.71 




28 


CKLT 


0.282 


0.05 




29 


CKRT 


1.828 




0.53 


30 


CLFT 


34.557 




1.50 


31 


KLRT 


19.099 




0.50 


32 


S:CLFT 


21.147 


8.61-'- 




33 


GJrKRT 


4.968 


2.02, 




34 


JLrKRT 


39.865 


16.22 




35 


CKLFT 


5.528 




1.21 


36 


KSrCLRT 


3.118 


1.27 




37 


GJLrKRT 


3.465 


1.41 




38 


JSrCKLRr 


2.456 







significant at the 0.01 level 



ERIC 



53 



TABLE 8 

SELECTED MEANS FROM Ei AND E2 




TABLE 9 



SU^WAR^ CF VARIANCE CCMPONENT ESTIMATES 





Sources 


Sources 












(^todel 2) 


(Model 1) 


Model 1 


Mcxael 2 (El) 


Model 2(E2) 


Model 2p 


1 


C 


C 


0.0912 


0.1295 


0.078i 


0.1038 


2 


K 


K 


0.5603 


0.0^738 


0.8469 


0.4604 




L 


GT:Cj 


0.0821 


0.4u76 


0.0824 


0.2700 


4 


R 


D 


0.1620 


-0.0469 


-0.0459 


-0.0464 


5 


T 


T 


"0.0462 


-0.0353 


-0,0111 


-0.0232 


» 


CK 


CK 


0.0091 


0.0246 


-0.0761 


-0.0258 


7 


CL 


(GTrCD) 




-0.0317 


0.0052 


-0.0133 


8 


CR 


CD 


0.1159 


-0.1445 


0.1302 


-0.0072 


9 


CT * 


CT 


0.0092 


-0.2211 


0.2400 


0.0095 


10 


KL 


CKTrCD 


0.0435 


-0.0445 


-0.3169 


-0.1897 


11 


KR 


KD 


0.0022 


0.1259 


0.4081 


0.2670 


12 


KT 


KT 


0.1076 


0.2901 


0.1112 


0.2007 


13 




(GTrCD) 




-1.8253 


-0.2680 


-1.0467 


14 




GrCD 


-0.1503 


-1.0175 


-0.2992 


-0.6584 


15 




DT 


0.1184 


-0.1360 


-0.0244 


-0.0802 


16 




(GKTrCD) 




-0.0422 


-0.0259 


-0.0341 


17 


OCR 


CKD 


-0.0262 


0.0568 


0.1304 


0.0936 


18 


CKT 


OCT 


-0.0007 


-0.0047 


0829 


0.0391 


19 


CLR 


(GTrCD) 




0.1617 


-0.1438 


0.0090 


20 


CLT 


(GTrCD) 




-0.0657 


-0.1514 


-0.1085 


21 


CRT 


CDT 


-0.0996 


Q.2312 


-0.2725 


-0.0207 


22 


KLR 


(GKTrCD) 




0.2351 


1.2747 


0.7549 


23 


KLT 


GKrCD 


0.0071 


-0.0477 


-0.1024 


-0.0750 


24 


KRT 


KDT 


0.3971 


0.0407 


-0.2029 


-0.0811 


25 


LPT 


(GrCD) 




3.4722 


0.4710 


1.9716 


26 


JrKRT 


JrKDT 


1.3902 


0.6602 


0.5352 


0.5977 


27 


CKU^ 


(GKTrCD) 




0.0564 


0.0982 


C.0773 


28 


CKLT 


(GKrCD) 




-0.0035 


-0.0656 


-0.0346 


29 


CKRT 


CKOT 


0.0796 


-0.0832 


-0.0950 


-0.0891 




CLRT 


(GrCD) 




0.0025 


0.3100 


0.1563 


31 


KURT 


(GKrCD) 




-0.5533 


-0.5357 


-0.5445 


32 


S:crj?r 


SrOGD 


3.6858 


4.0488 


4.6727 


4.3608 


33 


GJrKRT 


GJrKDT 


0.1771 


0.0553 


0.1256 


0.0907 


34 


JLiKRT 


CJrCKDT 


0.0687 


1.4414 


1.8704 


1.6559 


35 


OCLRT 


(GKrCD) 




-0.0290 


0.0701 


0.0206 


36 


KS:CLRr 


KSrOGD 


0.1522 


"0.0785 


0.3309 


0.1262 


37 


GJLrKRT 


(GKrCKOT) 




0.2358 


0.1009 


0.1684 


38 


JSrCXLRT 


JS. 3KOT 


2.8752 


3.2940 


2.4563 


2.8752 



ERIC 



55 



References 

Ahman, J.S. & M.D. Glock. Evaljating Pupil Growth . Boston: Allyn & Bacon, 1975. 
Bereiter, C. et al_. An applied cognitive-developmental apprpa^ i to writing 

research. Presented to the American Educational Research Association, 

San Francisco, 1979. 
Bond, J.T. & L. Tamor. Determining the validity and dependability of writing 

tests. Presented to the annual meeting, American Educational Research 

Association, San Francisco, 1979. 
Briggs, D. Influence of handwriting on assessment. Educational Research , 1970, 

13, 50-55. 

Cardinet, J. et aj^. The symmetry of ^eral izabil ity theory: applications to 
educational measurement. Journal of Educational Measurement , 1976, IJ, 
119-135. 

Chase, C.I. The impact of some obvious variables on essay test scores. Journal 

of Educational Measurement , 1968, 5, 315-318. 
Chasf , C.I. The impact of achievement expectations and handwriting quality on 

storing essay tests Journal of Educational Measurement , 1979, 16, 39-42 
Co^fman, W.E. Essay exaniinations . In R.L. Thorndike (ed.) Educational Measurement 

Washington- American Council on Education, 1971. 
Coffman, W.E. On the reliability of ratings of essay examinations. Measurement 

in Education , 1972, 3(3) . 
Conlan, G. Comparison of analytic and holistic scoring techniques. Presented to 

the annual meeting, American Educational Research Association, Boston, 19R0. 
Cooper, C.R. Holistiv': evaluation of w/iting. In C.R.Cooper & L. Odell (eds.) 

Evaluating Writing . Urbana. 111.: National Council of Teachers of English, 

1977. 

Cronbach. L.J. et aj[. The Dependability of Behavioral Measures . Toronto: Wiley, 
1972. 

ERiC 56 



4 . • • 



Crowhurst, M. The effect of ^jdience and mode of discourse on the syntactic 
complexity of written composition at two grade levels. Presented to the 
annual meeting, Canadian Educational Research Association, London, 197R. 

Diederich, P.B. & F.R. Link. Cooperative evaluation in English. In F. T. Wilhelms 
(ed.) Evaluation as Feedback and Guide Washington: Association for 
Supervision and Curriculum Development, 1967. 

Ebel , R.L. Estimation of the reliability of ratings. Psychometrik a , 1951, 1^, 
407-424. 

Finlayson, D_S. The reliability of marking essays. British Journal of Educational 

Psychology , 1951, 21, 126-114. 
Godshalk, F.J. et al^. The Measurement or Writing Ability . New York: College 

Entrance Examination Board, 1966. 
Hughes, D.C. £t al^. The influence of context position and scoring method on essay 

scoring. Journal of Educational Measurement , 1980, 17^, 111-135. 
Linhart, C.A. Application of general izabil ity theory to a complex rating situation. 

Presented to the annual meeting, American Educational Research Association, 

San Francisco, 1979. 

Markham, L.R. Influence of handwriting quality on teacher evaluation of written 

work. American Educational Research Journal , 1976, 13, 277-234. 
McNally, E.F. A study to determine through content analysis selected criteria for 

open-ended examinations. Presented to the annual meeting, American Educational 

Research Association, Toronto, 1978. 
Mlllman, J. i G.V. Glass. Rules of thumb for writing the Anova table. Journal 

of Educational Measurement , 1967, 4(2) , 41-51 
Powllls, J. A. et_ aj[. Holistic essay scor^-ng: an application of the model for the 

evaluation of writing ability and the measurement of growth in writing 

ability over time. Presented to the annual meeting, American Educational 

Research Association. San Francisco, 1979. 



ERLC 



57 



Rentz, R.R. Rules of thumb for estimating reliability coefficients using 
general izabil ity theo^-y. Educational and Psycholog^ical Measurement , 
1980, 40, 575-592. 

Rosen, H. An investig-ation of the effects of differentiated writing assignments 
on the performance in English composition of a selected group of 15/16- 
year-old pupils. University of London, Unpublished doctoral dissertation, 
1969. 

Shale, D. A factorial analysis of evaluations of English composition. Presented 
to the annual meeting. National Council on Measurement in Education, 
Toronto, 1978. 

Smith, P.L. Sampling errors of variance components in small sample multifacet 
general izabil ity studies. Journal of Educational Statistics , 197B, 3^, 
319-346. 

Soloff, S. Effect of non-content factors on the grading of essays. Graduate 

Research in Education and Related Disciplines , 1973, 6, 44-54. 
Steele, J.M. The assessment of writing proficiency via qualitative ratings of 

writing samples. Presentea oO the annual meeting. National Council on 

Measurement in Education, San Francisco, 1979. 
Vernon, P.E. & G. D_Millican. A further study of the reliability of English 

essays. British Journal of Statistical Psychology , 1954, 7^, 65-74. 
Winer, B.J. Statistical Principles in Experimental Design (2nd ed.) New York: 

McGraw-Hill, 1971. 



58 



